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Abstract 

Normalized web distance (NWD) is a similarity or normalized semantic distance based on the 
World Wide Web or any other large electronic database, for instance Wikipedia, and a search engine 
that returns reliable aggregate page counts. For sets of search terms the NWD gives a similarity on a 
scale from 0 (identical) to 1 (completely different). The NWD approximates the similarity according to 
all (upper semi)computable properties. We develop the theory and give applications. The derivation of 
the NWD method is based on Kolmogorov complexity. 

Index Terms — Normalized web distance, pattern recognition, data mining, similarity, classification, 
Kolmogorov complexity, 


I. Introduction 

Commonly objects are computer files that carry all their properties in themselves. However, 
there are also objects that are given by name, such as ‘red,’ ‘three,’ ‘Einstein,’ or ‘chair.’ Such 
objects acquire their meaning from the common knowledge of mankind. We can give objects 
either as the object itself or as the name of that object, such as the literal text of the work 
“Macbeth by Shakespeare” or the name “Macbeth by Shakespeare.” We focus on the name case 
using the background information provided by the World Wide Web, or another data base such 
as Wikipedia, and a search engine that produces reliable aggregate page counts. The frequencies 
involved enable us to compute a distance for each set of names. The normalized form of this 
distance expresses similarity, that is, the search engine discovers the “meaning” names have 
in common. Insofar as the meaning of names on the data base as discovered by this process 
approximates the meaning of those names in human society, the above distance expresses the 
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common semantics of the names. The term “name” is used here synonymously with “word” 
“search term” or “query.” The normalized distance above is called the normalized web distance 
(NWD). We apply it in classification. 


Example 1.1. Although Google gives notoriously unreliable counts it serves well enough for 
an example. On our scale of similarity, if NWD(X) = 0 then the search terms in the set X 
are identical, and if NWD(X) = 1 then the search terms in A" are as different as can be. On 
19 August 2014 searching for “Shakespeare” gave 124,000,000 hits; searching for “Macbeth” 
gave 22,400,000 hits; searching for “Hamlet” gave 51,300,000 hits; searching for “Shakespeare 
Macbeth” gave 7,730,000 hits; searching for “Shakespeare Hamlet” gave 18,500,000 hits; and 
searching for “Shakespeare Macbeth Hamlet” gave 663,000 hits. The number of web pages 
returned by Google was estimated by searching for “the” as 25,270,000,000. By ( |II.3[ ) we 
have NWD ({Shakespeare, Macbeth}) ~ 0.395, NW D({Shakespeare, Hamlet}) ~ 0.306 
and NWD({Shakespeare, Macbeth, Hamlet}) ~ 0.372. We conclude that Shakespeare and 
Hamlet have a lot in common, Shakespeare and Macbeth have a lot in common, and the 
commonality of Shakespeare, Hamlet, and Macbeth is intermediate between the two. <0 


To develop the theory behind the NWD we consider the information in individual objects. 
These objects are finite and expressed as finite binary strings. The classic notion of Kolmogorov 
complexity flU is an objective measure for the information in a single object, and information 
distance measures the information between a pair of objects 0 ]|. There arises the question of 
the shared information between many objects instead of just a pair of objects. 


A. Related Work 

The similarity or relative semantics between pairs of search terms was defined in |[5j and 
demonstrated in practice by using the World Wide Web as database and Google as search 
engine. The proposed normalized Google distance (NGD) works for any search engine that 
gives an aggregate page count for search terms. See for example 113, 0, If2fl . If20ll . [|3l and the 
many references to []5l in Google scholar. 

In ifTTTl the notion is introduced of the information required to go from any object in a finite 
multiset (a set where a member can occur more than once) of objects to any other object in the set. 
Let A" denote a finite multiset of n finite binary strings defined by {a; i,..., x n }, the constituting 
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elements ordered length-increasing lexicographic. The information distance in X is defined by 
E mgx (X) = min{|p| : U(xi,p,j) = x 3 for all x^Xj G X}. For instance, with A" = {x,y} the 
quantity 77 max ( X) is the least number of bits in a program to transform x to y and y to x. In 
lH8ll the mathematical theory is developed further and the difficulty of normalization is shown. 


B. Results 


The NWD is a similarity (a common semantics) between all search terms in a set. (We use 
set rather than multiset since a set is more appropriate in the context of search terms.) It can 
be thought of as a diameter of the set. For sets of cardinality two this diameter reduces to 
a distance between the two elements of the set. The NWD can be used for the classification 
of an unseen item into one of several classes (sets of names or phrases). This is simpler and 
computationally much easier that constructing the classes from the pairwise distances. In the 
latter solution inevitably information gets lost. 

The basic concepts like the web events, web distribution, and web code are given in Section [TT] 
We determine the length of a single shortest binary program to compute from any web event of 


a single member in a set to the web event associated with the whole set (Theorem II.5). The 
mentioned length is an absolute information distance associated with the set. It is incomputable 
(Lemma |II.4[ ). However, for different sets it can be large while a set has similar members and 
small when a (different) set has dissimilar members. Therefore we normali z e on a scale from 0 to 
1 to express the information distance or similarity between members of the set. We approximate 
the incomputable normalized version with the computable NWD (Definition |II.6[). In Section III 


we present properties of the NWD such as the range from 0 to 1 (Lemma [III. 1 [ ), whether and how 
it changes under adding members (Lemma [III. 2 [ ), and that it does not satisfy the triangle inequality 
and hence is not metric (Lemma |III.5[ ). Theorem III.7 and Corollary III.8 show that the NWD 
approximates the common similarity of the queries in a set of search terms (that is, a common 
semantics). We subsequently apply the NWD to various data sets based on search results from 
Amazon, Wikipedia and the National Center for Biotechnology Information (NCBI) website from 
the U.S. National Institutes of Health in Section |TVj We treat strings and self-delimiting strings 
in Appendix [A] computability notions in Appendix [B] Kolmogorov complexity in Appendix [C] 
and metric of sets in Appendix |D} The proofs are deferred to Appendix [Ej 
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II. Web Distribution and Web Code 

We give a derivation that holds for idealized search engines that return reliable aggregate page 
counts from their data bases (here called the web consisting of web pages). Subsequently we 
apply the idealized theory to real problems using real search engines on real data bases. 

A. Web Event 

The set of singleton search terms is denoted by S. a set of search terms is X = {.x'i,... ,x n } 
with Xi e S for 1 < i < n < oo, and X denotes the set of such X. Let the set of web pages 
indexed (possible of being returned) by the search engine be 0. 

Definition II.1. We define the web event e(X) C Q by the set of web pages returned by the 
search engine doing a search for X such that each web page in the set contains occurrences of 
all elements from X. 

If x,y G S and e(x) = e(y) then x ~ y and the equivalence class [x] = {y e S : y ~ 
Unless otherwise stated, we consider all singleton search terms that define the same web event 
as the same term. Hence we deal actually with equivalence classes [x] rather than x. However, 
for ease of notation we write x in the sequel and consider this to mean [a;]. 

If X = {xi,..., x n }, then e(X) = e(xi) f| ■ ■ ■ f| e(x n ) and the frequency f(X) = |e(X)|. 

The web event e(X) embodies all direct context in which all elements from X simultaneously 
occur in these web pages. Therefore web events capture in the outlined sense all background 
knowledge about this combination of search terms on the web. 

B. The Web Code 

It is natural to consider code words for events. We base those code words on the probability 
of the event. Consider the set 

T W)S = {(w, s) : w 6 fl, s G S,s occurs in w}. 

Then a = seS |T W)S |/|H| is the average number of search terms per web page in Q. Define 

the probability g(X) of X as g(X) = f(X)/N with N = a |U|. This probability may change 
over time, but let us imagine that the probability holds in the sense of an instantaneous snapshot. 
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A probability mass function on a known set allows us to define the associated prefix-code 
word length (information content) equal to unique decodable code word length 0, [|T3l . Such 
a prefix code is a code such that no code word is a proper prefix of any other code word. By 
the ubiquitous Kraft inequality [0, if li,h ,... is a sequence of positive integers satisfying 



(HI) 


then there is a set of prefix-code words of length , Z 2 ? - Conversely, if there is a set of 

prefix-code words of length Zi, h, ■ ■ ■ then these lengths satisfy the above displayed equation. By 
the fact that the probabilities of a discrete set sum to at most 1, every web event e(A) having 
probability g(X) can be encoded in a prefix-code word. 

Definition II.2. The length G{X) of the web code word for A" <E X is 


G(X) — log l/g(X), 


( 11 . 2 ) 


or oo for g(X) = 0. The case \X\ = 1 gives the length of the web code word for singleton 
search terms. The logarithms are throughout base 2. 

The web code is a prefix code. The code word associated with X and therefore with the web 
event e(X) can be viewed as a compressed version of the set of web pages constituting e(A). 
That is, the search engine compresses the set of web pages that contain all elements from A" 
into a code word of length G(X). 

Definition II.3. Let p E (0,1}* and X E X \ S. The information EG max (X) to compute 
event e(X) from event e(x) for any x E A" is defined by EG mSLX (X) = min p {\p\ : for all x E 
X we have U(e(x),p ) = e(A’)}. 

In this way EG max (X) corresponds to the length of a single shortest self-delimiting program to 
compute output e(A) from an input e(x) for all x E X. We use the notion of prefix Kolmogorov 
complexity K as in Appendix [Cj 

Lemma II.4. The function EG max is upper semicomputable but not computable. 

Theorem II.5. EG max (X) = m.&x. xe x{K{e(X)\e(x))} up to an additive logarithmic term 
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0(logmax a , eA '{A'(e(X)|e(a;))}) which we ignore in the sequel. 

To obtain the NWD we must normali z e EG max . Let us give some intuition first. Suppose 
X,Y G X\S. If the web events e(x)’s are more or less the same for all x G X then we consider 
the members of X very similar to each other. If the web events e(t/)’s are very different for 
different y G Y then we consider the members of Y to be very different from one another. 
Yet for certain X and Y depending on the cardinalities and the size of the web events of the 
members we can have EG max (X) = EG„ VAX (Y). That is to say, the similarity is dependent on 
size. Therefore, to express similarity of the elements in a set X we need to normalize EG max (X) 
using the cardinality of X and the events of its members. Expressing the normalized values on 
a scale of 0 to 1 allows us to express the degree in which all elements of a set are alike. Then 
we can compare truly different sets. 


Use the symmetry of information law (AT) to rewrite EG max (X) according to Theorem II.5 


as K(e(X)) — niin. KG x{A'(e(.x))} up to a logarithmic additive term which we ignore. Since 
G(X) is computable prefix code for e(X), while K(e(X)) is the shortest computable prefix 
code for e(X), it follows that K(e(X)) < G(X). Similarly K(e(x)) < G(x) for x G X. The 
search engine G returns frequency f{X) on query X (respectively frequency f(x) on query 
x). These frequencies are readily converted into G(X) (respectively G(x)) using ( |II.2| ). Replace 
K(e(X)) by G(X) and mm xeX {K(e(x))} by min xe ,Y{G(a;)} in EG max (X). Subsequently use 
as normalizing term i 11 ax G x { G ( x )} (| A" — 1). This yields the following. 

Definition II.6. The normalized web distance (NWD) of X G X with G(X) < oo (equivalently 
f{X) > 0)) is 

G(X) - min. re . Y {G(a:)} 


NWD(X ) = 


max l£ x {G(x)} (|X | - 1) 

maXj.gjY(log / (x)} - log f(X) 
(logX- min, ( . e x{log f(x)})(\X\ - 1)’ 


(II.3) 


otherwise NWD(X) is undefined. 


The second equality in ( |II.3[ ), expressing the NWD in terms of frequencies, is seen as 
follows. We use ( |II.2[ ). The numerator is rewritten by G(X) = log 1 /g(X) = \og(N/f(X)) = 
log N — logf(X) and min a . eX {G(a;)} = min^-tlog l/g(x)} = log N - max iB&x -{log/(a;)}. 
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The denominator is rewritten as max xe A'{G(x)}(|A"| — 1) = max x exjlog 1/ 5 , (x)}(|X| - 1) = 
(log N — mm xeX {1ogf(x)})(\X\ - 1). 

Remark II.7. By assumption f(X ) > 0 which, since it has integer values, means f(X) > 1. 
The case f(X ) = 0 means that there is an x G X such that e(x) 1>(*\ {a:}) = 0. That is, 
query x is independent of the set of queries X \ {a;}, that is, x has nothing in common with 
X \ {a;} since there is no common web page. Hence the NWD is undefined. The other extreme 
is that e(x) = e(y ) (x ~ y ) for all x,y G X. In this case the NWD(X) = 0. <> 


III. Theory 

Let X = {x,y} G X. We can rewrite (51 Section 3.4 formula (6)] for the NGD distance 
between x and y as NWD(X) up to a constant. Hence the NGD and NWD coincide for pairs 
up to a constant. For arbitrary sets the following holds. 

Lemma III.l. Let X e X\S. Then NWD(X) G [0,1], 


f(yi)f(X) (f(x a )Y 
/(*i )f(Y) ' 


We determine bounds on how the NWD may change under addition of members to its 
argument. These bounds are necessary loose since the added members may be similar to existing 
ones or very different. In Lemma |III.2 below we shall distinguish two cases for the relation 
between the minimum frequencies of members of X and Y with X C Y and the overall 
frequencies of X and Y. In the first case 

(\X\-1)NWD(X) 

umJ • (nu) 

where x 0 = argrnin xeX {log f(x)}, y 0 = argmin yey {log f(y)}, x x = argmax xeX {log f(x)}, 
and y 1 = arg max yey {log f(y)}. 

We give an example. Let \X\ = 5, /(x 0 ) = 1,100,000, f(yo) = 1,000,000, f(x i) = f(yi) = 
2, 000, 000, f(X) = 500, f(Y ) = 100, and NWD(X) = 0.5. The righthand side of the inequality 
( |III. 1 [ ) is 1.1 2 = 1.21 while the lefthand side is 5. In the second case inequality ( |III.1[ ) does 
not hold, that is, it holds with the > sign replaced by the < sign. We give an example. Let 
|X| = 5, f(x 0 ) = 1,100,000, f(y Q ) = 1,000,000, f( Xl ) = f(y i) = 2,000,000, f(X) = 110, 


f(Y) = 100, and NWD(X) = 0.5. The righthand side of the inequality ( |III.1| ) with > replaced 
by < is l.l 2 = 1.21 while the lefthand side is 1.1. 
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Lemma III.2. Let X, Z C Y, X, Y, Z G X \ S, and min 2 ez{/(*)} = min yeY {f(y)}- 
(i) Iff(y) > mi n xeX {f(x)}for all yeY then (\X\ - l)NWD(X) < (\Y\ - 1 )NWD(Y). 
(ii) Let f(y ) < min X ex{f(x)} for some y G Y. If ( |III.1[ ) holds then (\X\ — l)NWD(X) < 
(\Y\ - 1)NWD(Y). If dlTLlT ) does not hold then (\X\ - l)NWD(X) > (|F| - 1 )NWD(Y) > 
(\Z\ - l)NWD(Z). 

Example III.3. Consider the Shakespeare-Macbeth-Hamlet Example 0 Let 

X = {Shakespeare, Macbeth}, Y = {Shakespeare, Macbeth, Hamlet}, and 
Z = {Shakespeare, Hamlet}. Then inequality ( |III.1| ) for X versus Y gives 
(124,000,000 x 7, 730, 000/(124, 000, 000 x 663,000) > (22,400, 000/22,400, OOO) 0 395 (that is 
11.659 > 1), and for Z versus Y gives 18,500,000/663,000 > (51, 300, 000/22,400, OOO) 0 306 


(that is 27.903 > 1.289). In the first case Lemma III.2 item (i) is applicable since the frequency 


minima of A" and Y are the same. (In this case inequality ( |III.1[ ) is not needed.) Therefore 
NWD(X)(\X\ - 1)/(|Y| - 1) < NWDiY) which works out as 0.395/2 < 0.372. In the 


second case Lemma III.2 item (ii) is applicable since the frequency minima of Z and Y are not 


the same. Since inequality ( |III.1| ) holds this gives NWD(Z)(\Z\ — 1)/(|Y| — 1) < NWD(Y ) 
which works out as 0.306/2 < 0.372. <0 


Remark III.4. To interpret Lemma III.2 we give the following intuition. Under addition of a 
member to a set there are two opposing tendencies on the NWD concerned. Lirst, the range of 


the NWD stays fixed at a unit and ( |II.3| ) shows that addition of a member tends to decrease the 
NWD, that is, it moves closer to 0. Second, the common similarity of queries in a given set 
as measured by the NWD is based on the number of properties all members of a set have in 
common. By adding a member to the set clearly the number of common properties does not 
increase and generally decreases. This diminishing tends to cause the NWD to increase—move 
closer to 1. The first effect is visible when (|A| — l)NWD(X) > (|Y| — 1)NWD(Y), which 


happens in the case of Lemma III.2 item (ii) for the case when the frequencies do not satisfy 
( |A.2[ ). The second effect is visible when (|X| — l)NWD(X) < (|Y| — l)NWD(Y), which 
happens in Lemma III.2 item (i), and item (ii) with the frequencies satisfying ( |A.2[ ). (Note 
that to keep NWD{X) G [0,1] for all X we have the factor (|X| — 1) in the denominator of 
NWD(X). Without this factor the resulting function of X has range [0, |X| — 1] and in the 


inequalities in this remark and in the NWD formula (|II.3[) and all the previous theory properties 
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the factors X — 1 and \Y\ — 1 are replaced by 1.) 


For every set X we have that the NWD(X) is invariant under permutation of X: it is 
symmetric. The NWD is also positive definite as in Appendix [D] (where equal members should 
be interpreted as saying that the set has only one member). However the NWD does not satisfy 
the triangle inequality and hence is not a metric. This is natural for a common similarity or 
semantics: The members of a set XY can be less similar (have greater NWD) then the similarity 
of the members of XZ plus the similarity of the members of ZY for some set Z. 

Lemma III.5. The NWD violates the triangle inequality. 

It remains to formally prove that the NWD expresses in the similarity of the search terms in 
the set. We define the notion of a distance on these sets using the web as side-information. We 
consider only distances that are upper semicomputable, that is, the distance can be computably 
approximated from above (Appendix [B]). A priori we allow asymmetric distances, but we exclude 
degenerate distances such as d(X) = 1/2 for all X e X containing a fixed element x. That is, 
for every d we want only finitely many sets X 3 x such that d(X) < d. Exactly how fast we 
want the number of sets we admit to go to oo is not important; it is only a matter of scaling. 

Definition III.6. A web distance function (quantifying the common properties or common 
features) d : X —>• 1Z + is admissible if d(X) is (i) a nonnegative total real function and is 
0 iff A" e S; (ii) it is upper semicomputable from the e(x)’s with x £ X and e(X); and (iii) it 
satisfies the density requirement: for every x € S 




X3x, \X\>2 


We give the gist of what we are about to prove. Let X = {x\, x 2 , ■ ■ ■ ,x n }. A feature of a query 
is a property of the web event of that query. For example, the frequency in the web event of web 
pages containing an occurrence of the word “red.” We can compute this frequency for each e(xi) 
(1 < i < n). The minimum of those frequencies is the maximum of the number of web pages 
containing the word “red” which surely is contained in each web event e(xi),.... e(x n ). One 
can identify this maximum with the inverse of a distance in X. There are many such distances 
in X. The shorter a web distance is, the more dominant is the feature it represents. We show 
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that the minimum admissible distance is EG max (X). It is the least admissible web distance 
and represents the shortest of all admissible web distances in members of A". Hence the closer 
the numerator of NWD(X) is to EG mix {X) the better it represents the dominant feature all 
members of X have in common. 

Theorem III.7. Let X 6 X. The function G(X) — mm xG x{G(A') } ls a computable upper bound 
on EG m . rlx (X). The closer it is to EG m;ix (X), the better it approximates the shortest admissible 
distance in X. The normalized form of EG m . ix (X) is NWD(X). 

The normalized least admissible distance in a set is the least admissible distance between its 
members which we call the common admissible similarity. Therefore we have: 

Corollary III.8. The function NWD(X) is the common admissible similarity among all search 
terms in X. This admissible similarity can be viewed as semantics that all search terms in A" 
have in common. 


IV. Applications 

The application of the approach presented here requires the ability to query a database for the 
number of occurrences and co-occurrences of the elements in the set that we wish to analyze. 
One challenge is to find a database that has sufficient breadth as to contain a meaningful numbers 
of co-occurrences for related terms. As discussed previously, an example of one such database is 
the World Wide Web, with the page counts returned by Google search queries used as an estimate 
of co-occurrence frequency. There are two issues with using Google search page counts. The first 
issue is that Google limits the number of programmatic searches in a single day to a maximum 
of 100 queries, and charges for queries in excess of 100 at a rate of up to $50 per thousand. The 
second issue with using Google web search page counts is that the numbers are not exact, but 
are generated using an approximate algorithm that Google has not disclosed. For the questions 
considered previously [O we found that these approximate measures were sufficient at that time 
to generate useful answers, especially in the absence of any a priori domain knowledge. It is 
possible to implement internet based searches without using search engine API’s, and therefore 
not subject to daily limit. This can be accomplished by parsing the HTML returned by the search 
engine directly. The issue with google page counts in this study being approximate counts based 
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on a non-public algorithm was more concerning as changes in the approximation algorithm 
can influence page count results in a way that may not reflect true changes to the underlying 
distributions. Since any internet search that returns a results count can be used in computing the 
NWD, we adopt the approach of using web sites that return exact rather than approximate page 
counts for a given query. 

Here we describe a comparison of the NWD using the set formulation based on web-site 
search result counts with the pairwise NWD formulation. The examples are based on search 
results from Amazon, Wikipedia and the National Center for Biotechnology Information (NCBI) 
website from the U.S. National Institutes of Health. The NCBI website exposes all of the NIH 
databases searchable from a single web portal. We consider example classification questions 
that involve partitioning a set of words into underlying categories. For the NCBI applications 
we compare various diseases using the loci identified by large genome wide association studies 
(GWAS). For the NWD set classification, we determine whether to assign element x to class A 
or class B by computing NWD (Ax) — NWD (A) and NWD(Bx) — NWD(B) and assigning 
element x to whichever class achieves the minimum. 

For the pairwise formulation, we use the gap spectral clustering unsupervised approach 
developed in [0. Gap spectral clustering uses the gap statistic as first proposed in [fT71 to 
estimate the number of clusters in a data set from an arbitrary clustering algorithm. In 0, it 
was shown that the gap statistic in conjunction with a spectral clustering [|T5Tl of the distance 
matrix obtained from pairwise NWD measurements is an estimate of randomness deficiency for 
clustering models. Randomness deficiency is a measure of the meaningful information that a 
model, here a clustering algorithm, captures in a particular set of data Ifl2l . The approach is to 
select the number of clusters that minimizes the randomness deficiency as approximated by the 
gap value. In practice, this is achieved by picking the first number of clusters where the gap 
value achieves a maximum as described in 0. 

The gap value is computed by comparing the intra-cluster dispersions of the pairwise NWD 
distance matrix to that of uniformly distributed randomly generated data on the same range. For 
each value of k, the number of clusters in the data, we apply a spectral clustering algorithm to 
partition the data, assigning each element in the data to one of k clusters. Next, we compute 
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D r , the sum of the distances between elements in each cluster C r , 

D r N ^ dij. 

i,j£Cr 

The average intra-cluster dispersion is calculated, 

k 

r =1 

where n r is the number of points in cluster C r . The gap statistic is then computed as the 
difference between the averages of the intra-cluster distances of our data and the intra-cluster 
distances of B randomly generated uniformly distributed data sets of the same dimension as our 
data, 

1 B 

Gap(k) = — ^log(W kb ) - log(Wfc), 

6=1 

where 114-6 is the average intra-cluster dispersion obtained by running our clustering algorithm 
on each of the B randomly generated uniformly-distributed datasets. Following [j4| we set B to 
100. We compute the standard deviation of the gap value from a k , the standard deviation of 
the B uniformly distributed randomly generated data, adjusted to account for simulation error, 
as 

Sfc = Ofc\A + V B- 

Finally, we choose the smallest value of k for which 

Gap(k ) > Gap{k + 1) — Sfc + i. 


We now describe results from a number of sample applications. For all of these applications, 
we use a single implementation based on co-occurrence counts. For each search engine that we 
used, including Amazon, Wikipedia and NCBI a custom MATLAB script was developed to parse 
the search count results. We used the page counts returned using the built in search from each 
website for the frequencies, and following the approach in [0 choose N as the frequency for the 
search term ’the’. The results described were not sensitive to the choice of search term used to 
establish N, for example identical classification results were obtained using the counts returned 
by the search term ’N’ as the normalizing factor. Following each classification result below, we 
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include in parenthesis the 95% confidence interval for the result, computed as described in llT9ll 
The first three classification questions we considered used the wikipedia search engine. These 
questions include classifying colors vs. animals, classifying colors vs. shapes and classifying 
presidential candidates by political party for the US 2008 U.S. presidential election. For colors 
vs animals and shapes, gap spectral clustering found two groups in the data and classified 
all of the elements 100% correctly. The NWD set formulation classified the terms perfectly 
(0.82,1.0). For the presidential candidate classification by party, the pairwise NWD formulation 
performed poorly, classifying 58% correctly (0.32,0.8), while the multiset formulation obtained 
100% correct classification (0.76,1.0). Table |T] shows the data used for each question, together 
with the pairwise and set accuracy, the number of groups obtained by gap spectral clustering 
and the total number of website queries required for each method. 


search engine: wikipedia 


Number of Number of 

Multisets Pairwise Groups found queries queries 
Correct Correct by gap spectral (pairwise) (multisets) 


{red, orange, yellow, green, blue, indigo} 

100 % 100 % 

{lion, tiger, bear, monkey, zebra, 
elephant, aardvark, lamb, fox, ape, dog} 


{red, orange, yellow, green, blue, indigo, 
violet, purple, cyan, white} 

{square,circle,rectangle,ellipse,triangle, 
rhombus} 


100 % 100 % 


{Barack Obama, Hillary Clinton, John 
Edwards, Joe Biden, Chris Dodd, Mike 
Gravel} 

{John McCain, Mitt Romney, Mike 
Huckabee, Ron Paul, Fred Thompson, 
Alan Keyes} 


100% 58% 


2 


2 


2 


136 394 


105 342 


66 198 


TABLE I 

Classification results using wikipedia. 


The next classification question considered used page counts returned by the Amazon website 
search engine to classify book titles by author. Table [IT] summarizes the sets of novels associated 
with each author, and the classification results for each author as a confusion matrix. The Multiset 
NWD (top) misclassified one of the Tolstoy novels (’War and Peace’) to Stephen King, but 
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Shakespeare = {Macbeth, The Tempest, Othello, King Lear, Hamlet, The Merchant of Venice, A Midsummer Nights 
Dream, Much Ado About Nothing, Taming of the Shrew, Twelfth Night) 

King = {Carrie, Salems Lot, The Shining, The Stand, The Dead Zone, Firestarter, Cujo) 

Twain = {Adventures of Huckleberry Finn, A Connecticut Yankee in King Arthurs Court, Life on the Mississippi, 
Puddnhead Wilson) 

Hemingway = {The Old Man and The Sea, The Sun Also Rises, For Whom the Bell Tolls, A Farewell To Arms) 
Tolstoy = {Anna Karenina, War and Peace, The Death of Ivan Ilyich) 


Multiset NWD True Class 

Shakespeare King Twain Hemingway Tolstoy 


Shakespeare 

10 

0 

0 

0 

0 

King 

0 

7 

0 

0 

1 

Predicted 






Class Twam 

0 

0 

4 

0 

0 

Hemingway 

0 

0 

0 

4 

0 

Tolstoy 

0 

0 

0 

0 

2 





Correct: 96% 

Pairwise NWD 


True Class 



Shakespeare 

King 

Twain 

Hemingway Tolstoy 

Shakespeare 

10 

0 

0 

1 

1 

King 

0 

6 

0 

0 

0 

Predicted 






Class Twam 

0 

0 

4 

0 

0 

Hemingway 

0 

1 

0 

3 

3 

Tolstoy 

0 

0 

0 

0 

0 


Correct: 79% 


TABLE II 

Classifying novels by author using Amazon 


correctly classified all other novels correctly, 96% accurate (0.83,0.99). The pairwise NWD 
performed significantly more poorly, achieving only 79% accuracy (0.6,0.9). 

The final application considered is to quantify similarities among diseases based on the results 
of genome wide association studies (GWAS). These studies scan the genomes from a large 
population of individuals to identify genetic variations occurring at fixed locations, or loci that 
can be associated with the given disease. Here we use the the NIH NCBI database to search for 
similarities among diseases, comparing loci identified by recent GWAS results for each disease. 
The diseases included Alzheimers f22l . Par ki nsons ll27l . Amyotrophic lateral sclerosis (ALS) 
|, Schizophrenia ll28Tl . Leukemia [|24|. Obesity ll26ll . and Neuroblastoma E51 . The top of 


Table III lists the loci used for each disease. The middle panel of Table III shows at each 
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location (i,j) of the distance matrix the NWD computed for the combined counts for the loci of 
disease i concatenated with disease j. The diagonal elements (i. i) show the NWD for the loci 


of disease i. The bottom panel of Table III shows the NWD for each element with the diagonal 
subtracted, (i,j) — (i,i)- This is equivalent to the NWD(Ax) — NWD(A) value used in the 
previous classification problems. The two minimum values in the bottom panel, showing the 
relationships between Parkinsons and Obesity, as well as between Schizophrenia and Leukemia 
were surprising. The hypothesis was that neurological disorders such as Parkinsons, ALS and 
Alzheimers, would be more similar to each other. After these findings we found that there actually 
have been recent findings of strong relationships between both Schizophrenia and Leukemia lf29l 
as well as between Parkinsons and Obesity Il30l . relationships that have also been identified by 
clinical evidence not relating to GWAS approaches. 
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Schizophrenia = {'rs1702294', 'rsl 1191419', 'rs2007044', 'rs4129585', 'rs35518360'} 

Leukemia = {'rsl 7483466', 'rsl 3397985', 'rs757978\ 'rs2456449', 'rs735665', 'rs783540', 'rs305061', 'rs391525', 
'rsl 036935', 'rsl 1083846'} 

Alzheimers={'rs4420638', 'rs7561528', 'rs17817600', 'rs3748140', 'rsl2808148', 'rs6856768', 'rsl 1738335', 

'rsl 357692'}; 

Obesity={'rs10926984', 'rsl2145833', ’rs2783963\ 'rsl 1127485', 'rsl7150703', 'rsl3278851'}; 

Neuroblastoma = {'rs6939340', 'rs4712653', 'rs9295536', ’rs3790171', 'rs7272481'}; 

Parkinsons={'rs356219', 'rsl0847864', 'rs2942168', 'rsl 1724635'} 

ALS = {'rs2303565', 'rsl 344642', 'rs2814707', 'rs3849942\ 'rs2453556', 'rsl 971791', 'rs8056742'}; 


NWD(i,j) 



Alzheimers 

Parkinsons 

ALS 

Schizophrenia 

Leukemia 

Obesity 

Neuroblastoma 

Alzheimers 

1.29E-02 

2.43E-02 

1.38E-02 

1.55E-02 

1.23E-02 

1.49E-02 

1.61 E-02 

Parkinsons 

2.43E-02 

1.80E-02 

1.83E-02 

1.58E-02 

1.68E-02 

1.53E-02 

2.23E-02 

ALS 

1.38E-02 

1.83E-02 

9.76E-03 

1.19E-02 

1.46E-02 

9.96E-03 

1.75E-02 

Schizophrenia 

1.55E-02 

1.58E-02 

1.19E-02 

1.38E-02 

1.13E-02 

1.60E-02 

1.93E-02 

Leukemia 

1.23E-02 

1.68E-02 

1.46E-02 

1.13E-02 

7.54E-03 

1.15E-02 

1.61 E-02 

Obesity 

1.49E-02 

1.53E-02 

9.96E-03 

1.60E-02 

1.15E-02 

1.23E-02 

1.51 E-02 

Neuroblastoma 

1.61E-02 

2.23E-02 

1.75E-02 

1.93E-02 

1.61E-02 

1.51 E-02 

1.51 E-02 


NWD(i,j)-NWD(i,i) 



Alzheimers 

Parkinsons 

ALS 

Schizophrenia 

Leukemia 

Obesity 

Neuroblastoma 

Alzheimers 

0 

1.14E-02 

9.20E-04 

2.64E-03 

-6.08E-04 

1.98E-03 

3.22E-03 

Parkinsons 

6.26E-03 

0 

2.77E-04 

-2.28E-03 

-1.28E-03 

-2.76E-03 

4.26E-03 

ALS 

4.04E-03 

8.57E-03 

0 

2.1 IE-03 

4.87E-03 

2.00E-04 

7.75E-03 

Schizophrenia 

1.75E-03 

2.01 E-03 

-1.90E-03 

0 

-2.44E-03 

2.20E-03 

5.56E-03 

Leukemia 

4.73E-03 

9.23E-03 

7.09E-03 

3.78E-03 

0 

3.99E-03 

8.53E-03 

Obesity 

2.57E-03 

3.01 E-03 

-2.33E-03 

3.69E-03 

-7.58E-04 

0 

2.78E-03 

Neuroblastoma 

1.01E-03 

7.23E-03 

2.43E-03 

4.25E-03 

9.92E-04 

-1.04E-05 

0 


TABLE III 

GWAS LOCI USED AS INPUT TO NWD FOR QUANTIFYING DISEASE SIMILARITY USING THE NIH NCBI WEBSITE. 


V. Conclusion 

Consider queries to a search engine using a data base divided in chunks called web pages. On 
each query the search engine returns a set of web pages. We propose a method, the normalized 
web distance (NWD) for sets of queries that quantifies in a single number between 0 and 1 the 
way in which the queries in the set are similar: 0 means all queries in the set are the same (the 
set has cardinality one) and 1 means all queries in the set are maximally dissimilar to each other. 
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The similarity among queries uses the frequency counts of web pages returned for each query 
and the set of queries. The method can be applied using any big data base and a search engine 
that returns reliable aggregate page counts. Since this method uses names for object, and not the 
objects themselves, we can view the common similarity of the names as a common semantics 
between those names (words or phrases). The common similarity between a finite nonempty set 
of queries can be viewed as a distance or diameter of this set. We show that this distance ranges 
in between 0 and 1, how it changes under adding members to the set, that it does not satisfy 
the triangle property, and that the NWD formally and provably expresses common similarity 
(common semantics). 

To test the efficacy of the new method for classification we experimented with small data 
sets of queries based on search results from Wikipedia, Amazon, and the National Center 
for Biotechnology Information (NCBI) website from the U.S. National Institutes of Health. In 
particular we compared classification using pairwise NWDs with classification using set NWD. 
The last mentioned performed consistently equal or better, sometimes much better. 

Appendix 

A. Strings and the Self-Delimiting Property 

We write string to mean a finite binary string, and e denotes the empty string. (If the string is 
over a larger finite alphabet we recode it into binary.) The length of a string x (the number of bits 
in it) is denoted by \x\. Thus, |e| = 0. The self-delimiting code for x of length n is x — 
of length 2 n + 1, or even shorter x' = l x Qx of length n + 2 log n + 1 (see lll2ll for still shorter 
self-delimiting codes). Self-delimiting code words encode where they end. The advantage is that 
if many strings of varying lengths are encoded self-delimitingly using the same code, then their 
concatenation can be parsed in their constituent code words in one pass going from left to right. 
Self delimiting codes are computable prefix codes. A prefix code has the property that no code 
word is a proper prefix of any other code word. The code-word set is called prefix-free. 

We identify strings with natural numbers by associating each string with its 
index in the length-increasing lexicographic ordering according to the scheme 
(e, 0), (0,1), (1, 2), (00, 3), (01,4), (10, 5), (11, 6),.... In this way the Kolmogorov complexity 
can be about finite binary strings or natural numbers. 
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B. Computability Notions 

A pair of integers such as (p, q) can be interpreted as the rational p/q. We assume the notion 
of a function with rational arguments and values. A function fix) with x rational is upper 
semicomputable if it is defined by a rational-valued total computable function o(x. k ) with x a 
rational number and k a nonnegative integer such that (fix. k + 1) < o(x. k ) for every k and 
linifc^oo cj)(x, k) = f{x). This means that / can be computed from above (see llT2l . p. 35). A 
function / is lower semicomputable if — / is semicomputable from above. If a function is both 
upper semicomputable and lower semicomputable then it is computable. 

C. Kolmogorov Complexity 

The Kolmogorov complexity is the information in a single finite object BH1 • Informally, the 
Kolmogorov complexity of a finite binary string is the length of the shortest string from which 
the original can be lossless reconstructed by an effective general-purpose computer such as a 
particular universal Turing machine. Hence it constitutes a lower bound on how far a lossless 
compression program can compress. For technical reasons we choose Turing machines with a 
separate read-only input tape that is scanned from left to right without backing up, a separate 
work tape on which the computation takes place, an auxiliary tape inscribed with the auxiliary 
information, and a separate output tape. All tapes are divided into squares and are semi-infinite. 
Initially, the input tape contains a semi-infinite binary string with one bit per square starting 
at the leftmost square, and all heads scan the leftmost squares on their tapes. Upon halting, 
the initial segment p of the input that has been scanned is called the input program and the 
contents of the output tape is called the output. By construction, the set of halting programs is 
prefix free (Appendix [A]), and this type of Turing machine is called a prefix Turing machine. A 
standard enumeration of prefix Turing machines 7j , T 2 , ... contains a universal machine U such 
that Uii : p,y) = Tfp. y) for all indexes i, programs p, and auxiliary strings y. (Such universal 
machines are called “optimal” in contrast with universal machines like U' with U'ii. pp. y) = 
T.i{p,y ) for all i,p,y, and U'ii,q,y) = 1 for q pp for some p.) We call U the reference 
universal prefix Turing machine. This leads to the definition of prefix Kolmogorov complexity. 

Formally, the conditional prefix Kolmogorov complexity K(x\y) is the length of the shortest 
input z such that the reference universal prefix Turing machine U on input z with auxiliary 
information y outputs x. The unconditional Kolmogorov complexity K(x) is defined by K(x\c) 
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where e is the empty string. In these definitions both x and y can consist of strings into which 
finite sets of finite binary strings are encoded. Theory and applications are given in the textbook 

Da. 

For a finite set of strings we assume that the strings are length-increasing lexicographic ordered. 
This allows us to assign a unique Kolmogorov complexity to a set. The conditional prefix 
Kolmogorov complexity K(X\x) of a set X given an element x is the length of a shortest 
program p for the reference universal Turing machine that with input x outputs the set X. 
The prefix Kolmogorov complexity K(X) of a set X is defined by K(X |e). One can also 
put set in the conditional such as K(x\X) or K(X\Y). We will use the straightforward laws 
K(-\X,x) = K(-\X) and K(X\x) = I\(X'\x) up to an additive constant term, for x e X and 
X' equals the set X with the element x deleted. 

We use the following notions from the theory of Kolmogorov complexity. The symmetry of 
information property [J6]| for strings x, y is 

K{x,y) = K(x) + K(y\x) = K(y) + K(x\y ), (A.l) 

with equalities up to an additive term 0(\og(K(x, y))). 

D. Metricity 

A distance function d on X is defined by d : X —* 1Z r where 1Z + is the set of nonnegative 
real numbers. If X, Y, Z e X, then Z = XY if Z is the set consisting of the elements of the 
sets X and Y ordered length-increasing lexicographic. A distance function d is a metric if 

1) Positive definiteness: d(X) = 0 if all elements of X are equal and d(X) > 0 otherwise. 
(For sets equality of all members means AT = 1.) 

2) Symmetry: d{X) is invariant under all permutations of X. 

3) Triangle inequality: d(XY) < d(XZ) + d(ZY). 


E. Proofs 


Proof: of Lemma II.4 


We can run all programs dovetailed fashion and at each time instant select a shortest 
program that with inputs e(x) for all x G X has terminated with the same output e(X). 
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The lengths of these shortest programs gets shorter and shorter, and in for growing time 
eventually reaches EG max (X ) (but we do not know the time for which it does). Therefore 
EG max (X) is upper semicomputable. It is not computable since for X = {x. y) we have 
EG max (X ) = max.{K(e{x)\e{y)),K(e(y)\e(x))} + 0{l), the information distance between e(x) 
and e(y) which is known to be incomputable JT). ■ 


Proof: of Theorem II.5 


(<) We use a modification of the proof of IITO Theorem 2]. According to Definition |II.l 


x = y 


iff e(x) = e(y). Let A" = {aq,... ,x n } and k = max xeX {K(e(X)\e(x)}. A set of cardinality n 
in S is for the purposes of this proof represented by an //-vector of which the entries consist 
of the lexicographic length-increasing sorted members of the set. For each 1 < i < n let 
be the set of computably enumerated n- vectors V = (y 1: ... ,y n ) with entries in S such that 
K(e(Y)\e(yi)) < k for each 1 < i < n. Define the set V = IJ” =] 3V;- This V is the set of vertices 
of a graph G = (V,E). The set of edges E is defined by: two vertices u = (u\,... ,u n ) and 
v = (t’i,..., v n ) are connected by an edge iff there is 1 < j < n such that u :j = v 3 . There are 
at most 2 k self-delimiting programs of length at most k computing from input e(uj) to different 
e(n)’s with u :] in vertex v as yth entry. Hence there can be at most 2 k vertices v with Uj as jth 
entry. Therefore, for every u e V and 1 < j < n there are at most 2 k vertices v e V such 
that Vj = Uj. The vertex-degree of graph G is therefore bounded by n2 k . Each graph can be 
vertex-colored by a number of colors equal to the maximal vertex-degree. This divides the set of 
vertices V into disjoint color classes V = Vj |J " • U Vd with D < n2 k . To compute e(X) from 
e(x) with x G X we only need the color class of which e(X) is a member and the position of x 
in //-vector X. Namely, by construction every vertex with the same element in the yth position 
is connected by an edge. Therefore there is at most a single vertex with x in the yth position in a 
color class. Let x be the yth entry of n-vector X. It suffices to have a program of length at most 
log(n2 fc ) + 0(\ognk) = k + Oifognk) bits to compute e(X) from e(x). From n and k we can 
generate G and given log(//2 / '') bits we can identify the color class Vj of e(A). Using another 
logn bits we define the position of x in the n-vector X. To make such a program self-delimiting 
add a logarithmic term. In total k + 0(log k) suffices since O(logfc) = 0(logn + log nk). 

(>) That EG m . AX (X) > max xe x{X(e(X)\e(x)} follows trivially from the definitions. ■ 
Proof: of Lemma III. 1 [ 


(> 0) Since /(A) < f(x) for all x E X the numerator of the right-hand side of (|II.3|) is 


February 23, 2015 


DRAFT 








21 


nonnegative. Since the denominator is also nonnegative we have NWD(X) > 0. Example of 
the lower bound: if max^x-flog f(x)} = log /(X), then NWD(X) = 0. 

(< 1) Intuitively the upper bound on g{X) is reached if the web events e(x) for x G A" arc 
mutually almost disjoint. We say ’’almost” since if e ( x ) = 0 then NWD(X) is undefined. 

Case 1 Let the web events e(x) satisfy | f\ e x e(x)\ = 1. Then g(X) = Y[ x&x (g(x) - l/N) + 
1/N. By ( |II.2[ ) therefore ^ j6Y G(x) — G(X) = e for some very small positive e. 

Subcase l.a Let |e(x)| = \e(y)\ for all x,y G X. Then G(X ) — min x e x{G(x)} - 
(.X — 1) max xeX {G(x)} — e. By ( |II.3| ) we have NWD(X) = 1 — e' where e' = e/((X — 
1) max a;e x{G'(^)}). 

Subcase l.b Let |e(x)| ^ |e(t/)| for some x, y G X. Then G{X) — min.,. eX {G(x)} < (X - 


1) max a;e x{G r (a:)}. By ( |H.3[ ) we have NWD(X) < 1 — e'. 

Case 2 Let the web events e(x) satisfy |D a eY e ( a: )l > 1- Then g(X) > rLex^^) 
1/N) + 1/N yielding Y/, x &x G{x) — G(X) < e and therefore G(X) — min xe x{G'(x)} < (X 
1) max a;e x{G ? (a:)}. By ( |H.3[ ) we have NWD(X ) < 1 — e'. 


Proof: of Lemma III.2 


(i) Since X C Y and because of the condition of item (i) we have min :ye y{log f(y)} = 
min 3 . e x{log/(x)}. Lrom X C Y also follows maXj /e y{log/(y)} > max^gxjlog f(x)}, and 
log /(X) > log f(Y). Therefore the numerator of NWD(Y) is at least as great as that of 
NWD(X), and the denominator of NWD(Y) equals (|y| — 1)/(|X| — 1) times the denominator 
of NWD(X). 

(ii) We have min xe y log f(y) < min xe x{log f(x)}. If NWD(X) = 1 then NWD{Y ) = 1 
(in both cases there is no common similarity of the members of the set). Item (ii) follows 
vacuously in this case. Therefore assume that NWD(X) < 1. Write NWD(X) = a/b with a 
equal to the numerator of NWD(X) and b equal to the denominator. If c,d are real numbers 
satisfying c/d > a/b then be > ad. Therefore ab + be > ab + ad which rearranged yields 
(a + c)/(b + d) > a/b. If c/d < a/b then by similar reasoning (a + c) /(b + d) < a/b. 

Assume ( |III. 1[ ) holds. We take the logarithms of both sides of ( |III.1| ) and rearrange it to 
obtain log/(X) - max xe x{log f(x)} - log /(F) + max ye y{log f(y)} > (min xe x{log f(x)} - 
nrin^gyjlog f(y)})(\X\ — l)NWD(X). Let the lefthand side of the inequality be c and the 
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righthand side of the inequality be dNWD(X). Then 

max^gx{log f(x)} - log f(X) 


NWD(X) = 


< 


(log N — min xe . Y {log/(x)})(|A"| - 1) 
max ye y{log f(y)} - log f(Y ) 


(A.2) 


\Y\ - 1 


NWD(Y). 


(log N — min ye y{log f(y)})(\X\ -1) \X\ -1 

The inequality holds by the rewritten ( |III.1[ ) and the a,b,c,d argument above since c/d > 

NWD(X) = a/b. 

Assume ( |III. 1 [ ) does not hold, that is, it holds with the > sign replaced by a < sign. We take log¬ 
arithms of both sides of this last version and rewrite it to obtain log f(X) -max x exjlog f(x)}~ 
log f(Y) + max^y{log f(y)} < (min xeX {log f(x)} - min ye y{log f(y)})(\X\ - l)NWD(X). 
Let the lefthand side of the inequality be c and the righthand side dNWD(X). Since 
c/d < NWD(X) = a/b we have a/b > (a + c)/(b + d) by the a,b,c.d argument 
above. Hence ( |A.2[ ) holds with the < sign switched to a > sign. It remains to prove that 
NWD(Y) > NWD(Z)(\Z\ - 1)/(| Y\ - 1). This follows directly from item (i). ■ 


Proof: of Lemma III.5 


The following is a counterexample. Let X = {aq}, Y = {x 2 }, Z = {aq, x^}, 
max. Te xy{log f(x)} = 10, max a;e xz{log/(a:)} = 10, max ;r6Z y{log/(a;)} = 5, log /(AT) = 
log f(XZ) = log f(ZY) = 3, min xeA -y{log f(x)} = min xGXZ {f(x)} = min xeZ y{log f(x)} = 
4, and logA^ = 35. This arrangement can be realized for queries aq ; x 2 , x 3 , x±. (As usual 
we assume that e(xi) e(xj) for 1 < i,j < 4 and i f j.) Computation shows 
NWD(XY) > NWD(XZ) + NWD(ZY) since 7/31 > 7/62 + 1/62. ■ 


Proof: of Theorem III.7 


We start with the following: 

Claim A.l. EG max (X) is an admissible web distance function and EG max (X) < D(X) for 
every computable admissible web distance function D. 


Proof: Clearly EG max (X) satisfies items (i) and (ii) of Definition III.6 To show it is an 


admissible web distance it remains to establish the density requirement (iii). For fixed x consider 
the sets X 3 x and X > 2. We have 

2~ EGmax W < 1 

X:X3x & |X|>2 
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since for every x the set {EG mii jX) :l3i & EG m , lx (X) > 0} is the length set of a binary 
prefix code and therefore the summation above satisfies the Kraft inequality (9l given by ( |II.1[ ). 
Hence EG max is an admissible distance. 

It remains to prove minorization. Let D be a computable admissible web distance, and the 
function / defined by /( X,x) = 2~ D ^ X '> for x E X and 0 otherwise. Since D is computable the 
function / is computable. Given D, one can compute / and therefore K(f) < K(D ) + 0(1). 
Let m denote the universal distribution IfTZM . By [12, Theorem 4.3.2] com{X\x) > /( X,x) 
with cd = 2 A 0) — 2 A "f D )+°(i), that is, cp> is a positive constant depending on D only. By 
El Theorem 4.3.4] we have — logm(A'’|x) = K(X\x) + 0(1). Altogether, for every X E X 
and for every x E X holds log 1 /f(X,x) > K{ X\x) + logl /c D + 0(1). Hence D(X) > 
EG max (X) + log 1/co + 0(1). ■ 

By Lemma [lL4| the function EG max is upper semicomputable but not computable. The function 


G(X) — min xe x{0(x)} is a computable and an admissible function as in Definition III.6 By 


Claim A.l it is an upper bound on EG max (X) and hence EG max (X) < G(X) — min xe x{0(a;)}. 
Every admissible property or feature that is common to all members of X is quantized 
as an upper bound on EG max (X). Thus, the closer G{X) — min 3 . e x{G(a:)} approximates 
EG max (X), the better it approximates the common admissible properties among all search 
terms in X. This G(X) — iiiin xG x{G(a;)} is the numerator of NWD(X). The denominator 
is max xe x{G(a:)}(|A"| — 1), a normalizing factor suited to the numerator of NWD{X). It is 


chosen such that the quotient NWD(X) has a value in [0,1] (Lemma III. 1). 
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